feat(ingest/kafka-connect): Kafka connect infer lineage from DataHub #15234

treff7es · 2025-11-07T10:12:56Z

Summary

This PR adds automatic lineage inference from DataHub to the Kafka Connect source connector. Instead of relying solely on connector manifests, the ingestion can now query DataHub's metadata graph to resolve schemas and generate both table-level and column-level lineage.

Motivation

Currently, Kafka Connect lineage extraction is limited by what's explicitly declared in connector configurations. This PR enables:

Wildcard pattern expansion: Connectors configured with patterns like table.include.list: "database.*" can now be resolved to actual table names by querying DataHub
Column-level lineage: Generate fine-grained lineage showing which source columns map to Kafka topic fields
Schema-aware ingestion: Leverage existing metadata in DataHub to enrich Kafka Connect lineage without requiring external database connections
Auto-enabled for Confluent Cloud: Schema resolver automatically enabled for Confluent Cloud environments where enhanced lineage is most valuable

Changes

New Configuration Options

Added three new configuration fields to KafkaConnectSourceConfig:

source:
  type: kafka-connect
  config:
    # Enable DataHub schema resolution
    # Auto-enabled for Confluent Cloud, disabled for OSS by default
    use_schema_resolver: true
    
    # Expand wildcard patterns to concrete table names (default: true)
    schema_resolver_expand_patterns: true
    
    # Generate column-level lineage (default: true)
    schema_resolver_finegrained_lineage: true

Auto-Enable for Confluent Cloud

New behavior: use_schema_resolver is automatically enabled when Confluent Cloud is detected via:

confluent_cloud_environment_id + confluent_cloud_cluster_id configuration
URI pattern matching (api.confluent.cloud/connect/v1/)

Users can opt-out by explicitly setting use_schema_resolver: false.

Core Components

SchemaResolver Integration (connector_registry.py):
- New create_schema_resolver() method to instantiate schema resolvers with platform-specific configurations
- Automatically attaches resolvers to connector instances during creation
- Passes pipeline context through the instantiation chain
Fine-Grained Lineage Extraction (common.py):
- New _extract_fine_grained_lineage() method in BaseConnector
- Assumes 1:1 column mapping between source tables and Kafka topics (typical for CDC connectors)
- Generates FineGrainedLineageClass instances for column-level lineage
Enhanced Source Connectors (source_connectors.py):
- Snowflake Source Connector: Pattern expansion support (e.g., ANALYTICS.PUBLIC.* → actual tables)
- Debezium Connectors: Support for PostgreSQL, MySQL, SQL Server, MongoDB CDC
- JDBC Source Connector: Generic JDBC source with pattern matching
- Mongo Source Connector: MongoDB source with collection pattern expansion
- ConfigDriven Source Connector: Generic connector for new/unsupported connector types
Pattern Matching (pattern_matchers.py):
- New module for consistent pattern matching across all connectors
- Supports database wildcards (database.*, schema.table*, etc.)
- Handles platform-specific naming conventions
Configuration Constants (config_constants.py):
- Centralized configuration key definitions
- Reduces code duplication and typos
- Easier maintenance and documentation
Improved Topic Handling:
- Refactored topic extraction to distinguish between topics from API vs Kafka cluster
- Better handling of Confluent Cloud scenarios (no direct Kafka access)
- Improved sink connector topic filtering

Code Quality Improvements

Removed redundant connector instantiation in topic derivation
Enhanced null safety with comprehensive checks before schema resolver access
Improved error handling with detailed logging and graceful fallbacks
Sanitized test data (removed real company names)
Better separation of concerns (API topics vs Kafka topics)

Usage

OSS Kafka Connect (Default Behavior)

source:
  type: kafka-connect
  config:
    connect_uri: "http://localhost:8083"
    # use_schema_resolver: false (default - no change in behavior)

Confluent Cloud (Auto-Enabled)

source:
  type: kafka-connect
  config:
    confluent_cloud_environment_id: "env-123"
    confluent_cloud_cluster_id: "lkc-456"
    # use_schema_resolver: true (auto-enabled!)
    # Pattern expansion and column-level lineage work out of the box

Confluent Cloud (Opt-Out)

source:
  type: kafka-connect
  config:
    confluent_cloud_environment_id: "env-123"
    confluent_cloud_cluster_id: "lkc-456"
    use_schema_resolver: false  # Explicitly disabled

OSS with Schema Resolver (Explicit Enable)

source:
  type: kafka-connect
  config:
    connect_uri: "http://localhost:8083"
    use_schema_resolver: true
    schema_resolver_expand_patterns: true
    schema_resolver_finegrained_lineage: true

Testing

✅ Comprehensive test coverage across 15 unit test modules
✅ All linting checks pass (ruff)
✅ All type checks pass (mypy)
✅ Integration tests pass for OSS and Confluent Cloud scenarios

Test modules:

test_kafka_connect.py - Core connector tests
test_kafka_connect_config_validation.py - Auto-enable logic tests (8 new tests)
test_kafka_connect_schema_resolver.py - Schema resolver integration
test_kafka_connect_snowflake_source.py - Snowflake connector tests
test_kafka_connect_pattern_matchers.py - Pattern matching tests
test_kafka_connect_config_constants.py - Configuration validation
test_kafka_connect_connector_registry.py - Connector registration tests
Plus 8 additional test modules for specific features

Documentation

Updated docs/sources/kafka-connect/kafka-connect.md with:
- Auto-enable behavior explanation
- Configuration examples for OSS and Confluent Cloud
- Prerequisites and recommended ingestion order
- Opt-out instructions

Breaking Changes

None - All features are opt-in (or auto-enabled only for Confluent Cloud). Existing Kafka Connect ingestions continue to work unchanged.

Default behavior:

OSS: use_schema_resolver: false (unchanged behavior)
Confluent Cloud: use_schema_resolver: true (auto-enabled, can be disabled)
No changes to existing connector behavior without explicit configuration

Prerequisites for Schema Resolver

IMPORTANT: For schema resolver to work, source database tables must be ingested into DataHub before running Kafka Connect ingestion. Without prior database ingestion, schema resolver will not find table metadata.

Recommended ingestion order:

Ingest source databases (Postgres, MySQL, Snowflake, etc.) into DataHub
Run Kafka Connect ingestion (with schema resolver enabled/auto-enabled)
Enjoy enhanced lineage with column-level mappings!

🤖 Generated with Claude Code

Co-Authored-By: Claude [email protected]

codecov · 2025-11-07T10:16:43Z

❌ 1 Tests Failed:

Tests completed	Failed	Passed	Skipped
5456	1	5455	32

View the top 1 failed test(s) by shortest run time

tests.integration.cassandra.test_cassandra::test_cassandra_ingest

Stack Traces | 19.1s run time

docker_compose_runner = <function docker_compose_runner.<locals>.run at 0x7fe6ce9e2ca0>
pytestconfig = <_pytest.config.Config object at 0x7fe7edc11290>
tmp_path = PosixPath('.../pytest-of-runner/pytest-0/test_cassandra_ingest0')
monkeypatch = <_pytest.monkeypatch.MonkeyPatch object at 0x7fe6ce412ed0>

    @pytest.mark.integration
    def test_cassandra_ingest(docker_compose_runner, pytestconfig, tmp_path, monkeypatch):
        # Tricky: The cassandra container makes modifications directly to the cassandra.yaml
        # config file.
        # See https://github..../cassandra/issues/165
        # To avoid spurious diffs, we copy the config file to a temporary location
        # and depend on that instead. The docker-compose file has the corresponding
        # env variable usage to pick up the config file.
        cassandra_config_file = _resources_dir / "setup/cassandra.yaml"
        shutil.copy(cassandra_config_file, tmp_path / "cassandra.yaml")
        monkeypatch.setenv("CASSANDRA_CONFIG_DIR", str(tmp_path))
    
>       with docker_compose_runner(
            _resources_dir / "docker-compose.yml", "cassandra"
        ) as docker_services:

.../integration/cassandra/test_cassandra.py:35: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../hostedtoolcache/Python/3.11.14....../x64/lib/python3.11/contextlib.py:137: in __enter__
    return next(self.gen)
           ^^^^^^^^^^^^^^
.../datahub/testing/docker_utils.py:65: in run
    with pytest_docker.plugin.get_docker_services(
.../hostedtoolcache/Python/3.11.14....../x64/lib/python3.11/contextlib.py:137: in __enter__
    return next(self.gen)
           ^^^^^^^^^^^^^^
venv/lib/python3.11........./site-packages/pytest_docker/plugin.py:212: in get_docker_services
    docker_compose.execute(command)
venv/lib/python3.11........./site-packages/pytest_docker/plugin.py:140: in execute
    return execute(command, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

command = 'docker compose --parallel -1 -f ".../integration/cassandra/docker-compose.yml" -p "pytest4967-cassandra" up --build --wait'
success_codes = (0,), ignore_stderr = False

    def execute(command: str, success_codes: Iterable[int] = (0,), ignore_stderr: bool = False) -> Union[bytes, Any]:
        """Run a shell command."""
        try:
            stderr_pipe = subprocess.DEVNULL if ignore_stderr else subprocess.STDOUT
            output = subprocess.check_output(command, stderr=stderr_pipe, shell=True)
            status = 0
        except subprocess.CalledProcessError as error:
            output = error.output or b""
            status = error.returncode
            command = error.cmd
    
        if status not in success_codes:
>           raise Exception(
                'Command {} returned {}: """{}""".'.format(command, status, output.decode("utf-8"))
            )
E           Exception: Command docker compose --parallel -1 -f ".../integration/cassandra/docker-compose.yml" -p "pytest4967-cassandra" up --build --wait returned 1: """ test-cassandra-load-keyspace Pulling 
E            test-cassandra Pulling 
E            7e49dc6156b0 Pulling fs layer 
E            7e27b670a0f5 Pulling fs layer 
E            070c1638c21b Pulling fs layer 
E            4e292c31f904 Pulling fs layer 
E            b5e329fb7a0e Pulling fs layer 
E            c48c21d441c7 Pulling fs layer 
E            aaaa5c9cd791 Pulling fs layer 
E            87de823001cd Pulling fs layer 
E            248c2e9e4d9f Pulling fs layer 
E            d127e9af0f85 Pulling fs layer 
E            4e292c31f904 Waiting 
E            b5e329fb7a0e Waiting 
E            c48c21d441c7 Waiting 
E            aaaa5c9cd791 Waiting 
E            87de823001cd Waiting 
E            248c2e9e4d9f Waiting 
E            d127e9af0f85 Waiting 
E            7e27b670a0f5 Downloading [>                                                  ]  164.4kB/16.15MB
E            7e49dc6156b0 Downloading [>                                                  ]  303.1kB/29.54MB
E            070c1638c21b Downloading [>                                                  ]  483.3kB/47.06MB
E            7e27b670a0f5 Downloading [============================>                      ]  9.236MB/16.15MB
E            070c1638c21b Downloading [=============>                                     ]  12.38MB/47.06MB
E            7e49dc6156b0 Downloading [==============>                                    ]  8.778MB/29.54MB
E            7e27b670a0f5 Verifying Checksum 
E            7e27b670a0f5 Download complete 
E            070c1638c21b Downloading [======================>                            ]  20.94MB/47.06MB
E            7e49dc6156b0 Downloading [============================>                      ]  16.97MB/29.54MB
E            4e292c31f904 Downloading [==================================================>]     156B/156B
E            4e292c31f904 Download complete 
E            7e49dc6156b0 Downloading [=================================================> ]  28.95MB/29.54MB
E            7e49dc6156b0 Verifying Checksum 
E            7e49dc6156b0 Download complete 
E            070c1638c21b Downloading [================================>                  ]  30.45MB/47.06MB
E            7e49dc6156b0 Extracting [>                                                  ]  327.7kB/29.54MB
E            b5e329fb7a0e Downloading [====================>                              ]     953B/2.282kB
E            b5e329fb7a0e Downloading [==================================================>]  2.282kB/2.282kB
E            b5e329fb7a0e Verifying Checksum 
E            b5e329fb7a0e Download complete 
E            070c1638c21b Downloading [============================================>      ]  42.35MB/47.06MB
E            c48c21d441c7 Downloading [===========================>                       ]     953B/1.734kB
E            c48c21d441c7 Downloading [==================================================>]  1.734kB/1.734kB
E            c48c21d441c7 Verifying Checksum 
E            c48c21d441c7 Download complete 
E            7e49dc6156b0 Extracting [=========>                                         ]  5.898MB/29.54MB
E            070c1638c21b Verifying Checksum 
E            070c1638c21b Download complete 
E            87de823001cd Downloading [>                                                  ]  8.214kB/784.7kB
E            aaaa5c9cd791 Downloading [>                                                  ]  119.1kB/11.5MB
E            7e49dc6156b0 Extracting [==================>                                ]  10.81MB/29.54MB
E            87de823001cd Downloading [==================================================>]  784.7kB/784.7kB
E            87de823001cd Verifying Checksum 
E            87de823001cd Download complete 
E            248c2e9e4d9f Downloading [>                                                  ]  532.5kB/72.77MB
E            aaaa5c9cd791 Downloading [========================================>          ]  9.396MB/11.5MB
E            7e49dc6156b0 Extracting [============================>                      ]  17.04MB/29.54MB
E            aaaa5c9cd791 Verifying Checksum 
E            aaaa5c9cd791 Download complete 
E            248c2e9e4d9f Downloading [========>                                          ]  12.26MB/72.77MB
E            d127e9af0f85 Downloading [======================================>            ]     953B/1.223kB
E            d127e9af0f85 Downloading [==================================================>]  1.223kB/1.223kB
E            d127e9af0f85 Verifying Checksum 
E            d127e9af0f85 Download complete 
E            7e49dc6156b0 Extracting [=========================================>         ]  24.58MB/29.54MB
E            248c2e9e4d9f Downloading [===============>                                   ]  21.84MB/72.77MB
E            7e49dc6156b0 Extracting [===========================================>       ]  25.89MB/29.54MB
E            248c2e9e4d9f Downloading [========================>                          ]  36.21MB/72.77MB
E            248c2e9e4d9f Downloading [===============================>                   ]  45.82MB/72.77MB
E            7e49dc6156b0 Extracting [=================================================> ]  29.16MB/29.54MB
E            7e49dc6156b0 Extracting [==================================================>]  29.54MB/29.54MB
E            248c2e9e4d9f Downloading [========================================>          ]  59.14MB/72.77MB
E            7e49dc6156b0 Pull complete 
E            7e27b670a0f5 Extracting [>                                                  ]  163.8kB/16.15MB
E            248c2e9e4d9f Downloading [===============================================>   ]  68.78MB/72.77MB
E            7e27b670a0f5 Extracting [=========>                                         ]  3.113MB/16.15MB
E            248c2e9e4d9f Verifying Checksum 
E            248c2e9e4d9f Download complete 
E            7e27b670a0f5 Extracting [=======================>                           ]  7.537MB/16.15MB
E            7e27b670a0f5 Extracting [======================================>            ]  12.29MB/16.15MB
E            7e27b670a0f5 Extracting [==============================================>    ]  14.91MB/16.15MB
E            7e27b670a0f5 Extracting [===============================================>   ]  15.24MB/16.15MB
E            7e27b670a0f5 Extracting [==================================================>]  16.15MB/16.15MB
E            7e27b670a0f5 Pull complete 
E            070c1638c21b Extracting [>                                                  ]  491.5kB/47.06MB
E            070c1638c21b Extracting [==========>                                        ]  10.32MB/47.06MB
E            070c1638c21b Extracting [=====================>                             ]  20.64MB/47.06MB
E            070c1638c21b Extracting [================================>                  ]  30.47MB/47.06MB
E            070c1638c21b Extracting [==========================================>        ]  39.81MB/47.06MB
E            070c1638c21b Extracting [==================================================>]  47.06MB/47.06MB
E            070c1638c21b Pull complete 
E            4e292c31f904 Extracting [==================================================>]     156B/156B
E            4e292c31f904 Extracting [==================================================>]     156B/156B
E            4e292c31f904 Pull complete 
E            b5e329fb7a0e Extracting [==================================================>]  2.282kB/2.282kB
E            b5e329fb7a0e Extracting [==================================================>]  2.282kB/2.282kB
E            b5e329fb7a0e Pull complete 
E            c48c21d441c7 Extracting [==================================================>]  1.734kB/1.734kB
E            c48c21d441c7 Extracting [==================================================>]  1.734kB/1.734kB
E            c48c21d441c7 Pull complete 
E            aaaa5c9cd791 Extracting [>                                                  ]  131.1kB/11.5MB
E            aaaa5c9cd791 Extracting [=========================>                         ]  5.898MB/11.5MB
E            aaaa5c9cd791 Extracting [=========================================>         ]  9.568MB/11.5MB
E            aaaa5c9cd791 Extracting [================================================>  ]  11.14MB/11.5MB
E            aaaa5c9cd791 Extracting [==================================================>]   11.5MB/11.5MB
E            aaaa5c9cd791 Pull complete 
E            87de823001cd Extracting [==>                                                ]  32.77kB/784.7kB
E            87de823001cd Extracting [==================================================>]  784.7kB/784.7kB
E            87de823001cd Extracting [==================================================>]  784.7kB/784.7kB
E            87de823001cd Pull complete 
E            248c2e9e4d9f Extracting [>                                                  ]  557.1kB/72.77MB
E            248c2e9e4d9f Extracting [==================>                                ]   27.3MB/72.77MB
E            248c2e9e4d9f Extracting [===================================>               ]  52.36MB/72.77MB
E            248c2e9e4d9f Extracting [==================================================>]  72.77MB/72.77MB
E            248c2e9e4d9f Pull complete 
E            d127e9af0f85 Extracting [==================================================>]  1.223kB/1.223kB
E            d127e9af0f85 Extracting [==================================================>]  1.223kB/1.223kB
E            d127e9af0f85 Pull complete 
E            test-cassandra Pulled 
E            test-cassandra-load-keyspace Error Head "https://registry-1.docker..../cassandra/manifests/latest": Get "https://auth.docker.io/token?scope=repository%3Alibrary%2Fcassandra%3Apull&service=registry.docker.io": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E           Error response from daemon: Head "https://registry-1.docker..../cassandra/manifests/latest": Get "https://auth.docker.io/token?scope=repository%3Alibrary%2Fcassandra%3Apull&service=registry.docker.io": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E           """.

venv/lib/python3.11........./site-packages/pytest_docker/plugin.py:37: Exception

To view more test analytics, go to the Test Analytics Dashboard
_{📋 Got 3 mins? Take this short survey to help us improve Test Analytics.}

…n pipeline

sgomezvillamor · 2025-11-26T13:47:10Z

metadata-ingestion/docs/sources/kafka-connect/kafka-connect.md

impressive docs!

sgomezvillamor · 2025-11-26T13:49:49Z